Jász-Nagykun-Szolnok County
Flexing in 73 Languages: A Single Small Model for Multilingual Inflection
Sourada, Tomáš, Straková, Jana
We present a compact, single-model approach to multilingual inflection, the task of generating inflected word forms from base lemmas to express grammatical categories. Our model, trained jointly on data from 73 languages, is lightweight, robust to unseen words, and outperforms monolingual baselines in most languages. This demonstrates the effectiveness of multilingual modeling for inflection and highlights its practical benefits: simplifying deployment by eliminating the need to manage and retrain dozens of separate monolingual models. In addition to the standard SIGMORPHON shared task benchmarks, we evaluate our monolingual and multilingual models on 73 Universal Dependencies (UD) treebanks, extracting lemma-tag-form triples and their frequency counts. To ensure realistic data splits, we introduce a novel frequency-weighted, lemma-disjoint train-dev-test resampling procedure. Our work addresses the lack of an open-source, general-purpose, multilingual morphological inflection system capable of handling unseen words across a wide range of languages, including Czech.
- North America > United States > Washington > King County > Seattle (0.05)
- North America > Canada > Ontario > Toronto (0.05)
- North America > United States > Maryland > Baltimore (0.04)
- (13 more...)
Can a Neural Model Guide Fieldwork? A Case Study on Morphological Data Collection
Mahmudi, Aso, Herce, Borja, Amestica, Demian Inostroza, Scherbakov, Andreas, Hovy, Eduard, Vylomova, Ekaterina
Linguistic fieldwork is an important component in language documentation and preservation. However, it is a long, exhaustive, and time-consuming process. This paper presents a novel model that guides a linguist during the fieldwork and accounts for the dynamics of linguist-speaker interactions. We introduce a novel framework that evaluates the efficiency of various sampling strategies for obtaining morphological data and assesses the effectiveness of state-of-the-art neural models in generalising morphological structures. Our experiments highlight two key strategies for improving the efficiency: (1) increasing the diversity of annotated data by uniform sampling among the cells of the paradigm tables, and (2) using model confidence as a guide to enhance positive interaction by providing reliable predictions during annotation.
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- (12 more...)
- Research Report > New Finding (0.93)
- Research Report > Promising Solution (0.66)
Why do language models perform worse for morphologically complex languages?
Arnett, Catherine, Bergen, Benjamin K.
Language models perform differently across languages. It has been previously suggested that morphological typology may explain some of this variability (Cotterell et al., 2018). We replicate previous analyses and find additional new evidence for a performance gap between agglutinative and fusional languages, where fusional languages, such as English, tend to have better language modeling performance than morphologically more complex languages like Turkish. We then propose and test three possible causes for this performance gap: morphological alignment of tokenizers, tokenization quality, and disparities in dataset sizes and measurement. To test the morphological alignment hypothesis, we present MorphScore, a tokenizer evaluation metric, and supporting datasets for 22 languages. We find some evidence that tokenization quality explains the performance gap, but none for the role of morphological alignment. Instead we find that the performance gap is most reduced when training datasets are of equivalent size across language types, but only when scaled according to the so-called "byte-premium" -- the different encoding efficiencies of different languages and orthographies. These results suggest that no language is harder or easier for a language model to learn on the basis of its morphological typology. Differences in performance can be attributed to disparities in dataset size. These results bear on ongoing efforts to improve performance for low-performing and under-resourced languages.
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (28 more...)
- Research Report > Experimental Study (0.69)
- Research Report > New Finding (0.67)
OOVs in the Spotlight: How to Inflect them?
Sourada, Tomáš, Straková, Jana, Rosa, Rudolf
We focus on morphological inflection in out-of-vocabulary (OOV) conditions, an under-researched subtask in which state-of-the-art systems usually are less effective. We developed three systems: a retrograde model and two sequence-to-sequence (seq2seq) models based on LSTM and Transformer. For testing in OOV conditions, we automatically extracted a large dataset of nouns in the morphologically rich Czech language, with lemma-disjoint data splits, and we further manually annotated a real-world OOV dataset of neologisms. In the standard OOV conditions, Transformer achieves the best results, with increasing performance in ensemble with LSTM, the retrograde model and SIGMORPHON baselines. On the real-world OOV dataset of neologisms, the retrograde model outperforms all neural models. Finally, our seq2seq models achieve state-of-the-art results in 9 out of 16 languages from SIGMORPHON 2022 shared task data in the OOV evaluation (feature overlap) in the large data condition. We release the Czech OOV Inflection Dataset for rigorous evaluation in OOV conditions. Further, we release the inflection system with the seq2seq models as a ready-to-use Python library.
- North America > United States > Washington > King County > Seattle (0.05)
- Europe > Germany > Berlin (0.04)
- North America > United States > Pennsylvania (0.04)
- (10 more...)
J-UniMorph: Japanese Morphological Annotation through the Universal Feature Schema
Matsuzaki, Kosuke, Taniguchi, Masaya, Inui, Kentaro, Sakaguchi, Keisuke
We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the language's agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 inflected forms for each word and is primarily dominated by denominal verbs (i.e., [noun] +suru (do-PRS)). Morphologically, this form is equivalent to the verb suru (do). In contrast, J-UniMorph explores a much broader and more frequently used range of verb forms, offering 118 inflected forms for each word on average. It includes honorifics, a range of politeness levels, and other linguistic nuances, emphasizing the distinctive characteristics of the Japanese language. This paper presents detailed statistics and characteristics of J-UniMorph, comparing it with the Wiktionary Edition. We release J-UniMorph and its interactive visualizer publicly available, aiming to support cross-linguistic research and various applications.
- Asia > Japan > Honshū > Tōhoku (0.05)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (6 more...)
Exploring Linguistic Probes for Morphological Generalization
Kodner, Jordan, Khalifa, Salam, Payne, Sarah
SIGMORPHON and SIGMORPHON-UniMorph Three languages were chosen whose inflectional shared tasks (Cotterell et al., 2016, 2017, 2018; morphologies range from entirely fusional (English), McCarthy et al., 2019; Vylomova et al., 2020; Pimentel to mixed (Spanish), to mostly agglutinative et al., 2021; Kodner et al., 2022) as well (Swahili). In highly agglutinative languages, individual as in more targeted studies focused on specific languages features in a set tend to correspond to distinct or the generalization behavior of computational morphological patterns, so a model may generalize models (Goldman et al., 2022; Wiemerslage to unseen feature sets by mapping component et al., 2022; Kodner et al., 2023b; Guriel et al., features to their corresponding patterns. This is 2023; Kodner et al., 2023a), is to train on (lemma, exemplified by the Swahili example (1), in which inflection, features) triples and predict inflected most features correspond to individual morphemes; forms from held-out (lemma, features) only the person/number prefix maps to more than pairs.
- Asia > Middle East > Jordan (0.05)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > New York > Suffolk County > Stony Brook (0.04)
- (9 more...)
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Cahyawijaya, Samuel, Lovenia, Holy, Aji, Alham Fikri, Winata, Genta Indra, Wilie, Bryan, Mahendra, Rahmad, Wibisono, Christian, Romadhony, Ade, Vincentio, Karissa, Koto, Fajri, Santoso, Jennifer, Moeljadi, David, Wirawan, Cahya, Hudi, Frederikus, Parmonangan, Ivan Halim, Alfina, Ika, Wicaksono, Muhammad Satrio, Putra, Ilham Firdausi, Rahmadani, Samsul, Oenang, Yulianti, Septiandri, Ali Akbar, Jaya, James, Dhole, Kaustubh D., Suryani, Arie Ardiyanti, Putri, Rifki Afina, Su, Dan, Stevens, Keith, Nityasya, Made Nindyatama, Adilazuarda, Muhammad Farid, Ignatius, Ryan, Diandaru, Ryandito, Yu, Tiezheng, Ghifari, Vito, Dai, Wenliang, Xu, Yan, Damapuspita, Dyah, Tho, Cuk, Karo, Ichwanul Muslim Karo, Fatyanosa, Tirana Noor, Ji, Ziwei, Fung, Pascale, Neubig, Graham, Baldwin, Timothy, Ruder, Sebastian, Sujaini, Herry, Sakti, Sakriani, Purwarianti, Ayu
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
- North America > United States > Texas > Dallas County > Dallas (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Timor-Leste (0.14)
- (64 more...)
- Law (0.67)
- Government (0.67)
- Information Technology > Services (0.67)
- (3 more...)
Morphological Inflection with Phonological Features
Guriel, David, Goldman, Omer, Tsarfaty, Reut
Recent years have brought great advances into solving morphological tasks, mostly due to powerful neural models applied to various tasks as (re)inflection and analysis. Yet, such morphological tasks cannot be considered solved, especially when little training data is available or when generalizing to previously unseen lemmas. This work explores effects on performance obtained through various ways in which morphological models get access to subcharacter phonological features that are the targets of morphological processes. We design two methods to achieve this goal: one that leaves models as is but manipulates the data to include features instead of characters, and another that manipulates models to take phonological features into account when building representations for phonemes. We elicit phonemic data from standard graphemic data using language-specific grammars for languages with shallow grapheme-to-phoneme mapping, and we experiment with two reinflection models over eight languages. Our results show that our methods yield comparable results to the grapheme-based baseline overall, with minor improvements in some of the languages. All in all, we conclude that patterns in character distributions are likely to allow models to infer the underlying phonological characteristics, even when phonemes are not explicitly represented.
- North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (6 more...)
Morphological Inflection: A Reality Check
Kodner, Jordan, Payne, Sarah, Khalifa, Salam, Liu, Zoey
Morphological inflection is a popular task in sub-word NLP with both practical and cognitive applications. For years now, state-of-the-art systems have reported high, but also highly variable, performance across data sets and languages. We investigate the causes of this high performance and high variability; we find several aspects of data set creation and evaluation which systematically inflate performance and obfuscate differences between languages. To improve generalizability and reliability of results, we propose new data sampling and evaluation strategies that better reflect likely use-cases. Using these new strategies, we make new observations on the generalization abilities of current inflection systems.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
- North America > United States > New York > Suffolk County > Stony Brook (0.04)
- Asia > Middle East > Jordan (0.04)
- (21 more...)
K-UniMorph: Korean Universal Morphology and its Feature Schema
Jo, Eunkyul Leah, Kim, Kyuwon, Wu, Xihan, Lim, KyungTae, Park, Jungyeul, Park, Chulwoo
We present in this work a new Universal Morphology dataset for Korean. Previously, the Korean language has been underrepresented in the field of morphological paradigms amongst hundreds of diverse world languages. Hence, we propose this Universal Morphological paradigms for the Korean language that preserve its distinct characteristics. For our K-UniMorph dataset, we outline each grammatical criterion in detail for the verbal endings, clarify how to extract inflected forms, and demonstrate how we generate the morphological schemata. This dataset adopts morphological feature schema from Sylak-Glassman et al. (2015) and Sylak-Glassman (2016) for the Korean language as we extract inflected verb forms from the Sejong morphologically analyzed corpus that is one of the largest annotated corpora for Korean. During the data creation, our methodology also includes investigating the correctness of the conversion from the Sejong corpus. Furthermore, we carry out the inflection task using three different Korean word forms: letters, syllables and morphemes. Finally, we discuss and describe future perspectives on Korean morphological paradigms and the dataset.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > France > Île-de-France > Paris > Paris (0.06)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
- (16 more...)